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60 Abstract 

In this paper, we study the nonnegative matrix factorization problem under the separability 
assumption (that is, there exists a cone spanned by a small subset of the columns of the input 
nonnegative data matrix containing all columns) , which is equivalent to the hyperspectral unmixing 
problem under the linear mixing model and the pure-pixel assumption. We present a family of fast 
recursive algorithms, and prove they are robust under any small perturbations of the input data 
matrix. This family generalizes several existing hyperspectral unmixing algorithms and hence 
provides for the first time a theoretical justification of their better practical performance. 

Keywords, nonnegative matrix factorization, algorithms, separability, robustness, hyperspectral 
| unmixing, linear mixing model, pure-pixel assumption. 

1 Introduction 

A hyperspectral image consists of a set of images taken at different wavelengths. It is acquired 
by measuring the spectral signature of each pixel present in the scene, that is, by measuring the 
reflectance (the fraction of the incident electromagnetic power that is reflected by a surface at a given 
wavelength) of each pixel at different wavelengths. One of the most important task in hyperspectral 
imaging is called unmixing. It requires the identification of the constitutive materials present in the 
image and estimation of their abundances in each pixel. The most widely used model is the linear 
mixing model: the spectral signature of each pixel results from the additive linear combination of the 
spectral signatures of the constitutive materials, called endmembers, where the weights of the linear 
combination correspond to the abundances of the different endmembers in that pixel. 

More formally, let the m-by-n matrix M correspond to a hyperspectral image with m spectral 
bands and n pixels, and where each entry My > of matrix M is equal to the reflectance of the jth 
pixel at the ith wavelength. Hence, each column rrii of M corresponds to the spectral signature of a 
given pixel. Assuming the image contains r constitutive materials whose spectral signatures are given 
by the vectors Wk £ 1 < k < r, we have, in the noiseless case, 

r 

nii = ^Wkhki, for i = 1,2, . . . ,n, 
k=l 

where hki > is the abundance of the kth endmember in the ith pixel, with Ylk=i hki = 1 Vi. Defining 
the m-by-r matrix W = [w\ W2 ■ ■ ■ Wk) — and the r-by-re matrix H with Hki = hki Vi, k, the equation 
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above can be equivalently written as M = WH where M, W and H are nonnegative matrices. Given 
the nonnegative matrix M, hyperspectral unmixing amounts to recover the endmember matrix W and 
the abundance matrix H. This inverse problem corresponds to the nonnegative matrix factorization 
problem (NMF), which is a highly ill-posed and difficult problem [23] . 

However, if we assume that, for each constitutive material, there exists at least one pixel containing 
only that material (a 'pure' pixel), then the unmixing problem can be solved in polynomial time: it 
simply reduces to identifying the vertices of the convex hull of a set of points. This assumption, referred 
to as the pure-pixel assumption [13], is essentially equivalent to the separability assumption [13]: a 
nonnegative matrix M is called separable if it can be written as M = WH where each column of W 
is equal, up to a scaling factor, to a column of M. In other words, there exists a cone spanned by a 
small subset of the columns of M containing all columns (see Section \2. II for more details). It is worth 
noting that this assumption also makes sense for other applications. For example, in text mining, 
each entry My of matrix M indicates the 'importance' of word i in document j (e.g., the number of 
appearance of word i in text j). The factors (W,H) can then be interpreted as follows: the columns 
of W represent the topics (i.e., bags of words) while the columns of H link the documents to these 
topics. Therefore, 

• Separability of M (that is, each column of W appears as a column of M) requires that, for each 
topic, there exists at least one document discussing only that topic (a 'pure' document). 

• Separability of M T (that is, each row of H appears as a row of M) requires that, for each topic, 
there exists at least one word used only by that topic (a 'pure' word). 

These assumptions often make sense in practice and are actually part of several existing document 
generative models, see [3] |3J and the references therein. 

1.1 Previous Work 

We focus in this paper on hyperspectral unmixing algorithms under the linear mixing model and 
the pure-pixel assumption, or, equivalently, to nonnegative matrix factorization algorithms under the 
separability assumption. Many algorithms handling this situation have been developed by the remote 
sensing community, see [5] for a comprehensive overview of recent hyperspectral unmixing algorithms. 
Essentially, these algorithms amount to computing the vertices of the convex hull of the (normalized) 
columns of M, or, equivalently, the extreme rays of the convex cone generated by the columns of M. 
However, as far as we know, none of these algorithms have been proved to work when the input data 
matrix M is only approximately separable (that is, the original separable matrix is perturbed with 
some noise), and many algorithms are therefore not robust to noise. However, there exists a few recent 
notable exceptions: 

• Arora et al. Section 5] proposed a method which requires the resolution of n linear programs 
in 0(n) variables (n is the number of columns of the input matrix), and is therefore not suited 
to dealing with large-scale real-world problems. In particular, in hyperspectral imaging, n cor- 
responds to the number of pixels in the image and is of the order of 10 6 . Moreover, it needs 
several parameters to be estimated a priori (the noise level, and a function of the columns of W; 
see Section [23D . 

• Esser et al. [16] proposed a convex model with n 2 variables (see also [15] where a similar approach 
is presented), which is computationally expensive. In order to deal with a large-scale real- 
world hyperspectral unmixing problem, the authors had to use a preprocessing, namely fe-means, 
to select a subset of the columns in order to reducing the dimension n of the input matrix. 
Their technique also requires a parameter to be chosen in advance (either the noise level, or a 
penalty parameter balancing the importance between the approximation error and the number 



2 



of endmembers to be extracted), only applies to a restricted noise model, and cannot deal with 
repeated columns of W in the dataset (i.e., repeated endmembers). 

• Bittorf et al. [6j proposed a method based on the resolution of a single convex optimization 
problem in n 2 variables (cf. Section f5.2p . In order to deal with large-scale problems (m ~ 10 6 , 
n ~ 10 5 ), a fast incremental gradient descent algorithm using a parallel architecture is imple- 
mented. However, the algorithm requires several parameters to be tuned, and the factorization 
rank has to be chosen a priori. Moreover, it would be impractical for huge-scale problems (for 
example for web-related applications where n ~ 10 9 ), and the speed of convergence could be an 
issue. 

1.2 Contribution and Outline of the Paper 

In this paper, we propose a new family of recursive algorithms for nonnegative matrix factorization 
under the separability assumption. They have the following features: 

• They are robust to noise (Theorem [3]). 

• They are very fast, running in approximately 6mnr floating point operations, while the memory 
requirement is low, as only one m-by-n matrix has to be stored. 

• They are extremely simple to implement and would be easily parallelized. 

• They do not require any parameter to be chosen a priori, nor to be tuned. 

• The solution does not need to be recomputed from scratch when the factorization rank is mod- 
ified, as the algorithms are recursive. 

• A simple post-processing strategy allows us to identifying outliers (Section [3|) . 

• Repeated endmembers are not an issue. 

• Even if the input data matrix M is not approximately separable, they identify r columns of M 
whose convex hull has large volume (Section 14. ip . 

To the best of our knowledge, no other algorithms share all these desirable properties. The weak point 
of our approach is that the bound on the noise to guarantee recovery is weaker than in [3j [6] ; see 
Section 12.41 Also, we will need to assume that the matrix W is full rank, which is not a necessary 
condition for the approaches above [3"1 I16t 16]. However, in practice, this condition is satisfied in most 
cases. At least, it is always assumed to hold in hyperspectral imaging and text mining applications, 
otherwise the abundance matrix H is typically not uniquely determined; see Section 12.11 Moreover, 
in Section 15.21 our approach will be shown to perform in average better than the one proposed in [B] 
on several synthetic datasets. 

The paper is organized as follows. In Section [21 we introduce our approach and derive an a priori 
bound on the noise to guarantee the recovery of the pure pixels. In Section [31 we propose a simple 
way to handle outliers. In Section UJ we show that this family of algorithms generalizes several hyper- 
spectral unmixing algorithms, including the successive projection algorithm (SPA) [2], the automatic 
target generation process (ATGP) [20], the successive volume maximization algorithm (SVMAX) |llj . 
and the p-norm based pure pixel algorithm (TRI-P) [1]. Therefore, our analysis gives the first the- 
oretical justification of the better performances of this family of algorithms compared to algorithms 
based on locating pure pixels using linear functions (such as the widely used PPI [7] and VCA [19] 
algorithms) which are not robust to noise. This was, until now, only experimentally observed. Finally, 
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we illustrate these theoretical results on several synthetic datasets in Section [5j 

Notation. Given a matrix X, Xk, X± or X(:,k) denotes its fcth column, and X^-, Xik or Xk(i) the 
entry at position (i, k) (ith entry of column k). For a vector x, Xi or x(i) denotes the ith entry of 
x. The unit simplex in dimension n is denoted A n = {x 6 M n | x > 0,^™ =1 Xj < 1}. We use the 

MATLAB notation [A 5] = ( ,4 £ ) and [A; B] = ( ^ \ . Given a matrix W G M mxr , m > r, we 

denote <7i(W) the singular values of W in non-decreasing order, that is, 

cri(W) > a 2 (W) > ■ ■ ■> a r (W) > 0. 

The m-by-ra all- zero matrix is denoted mX n while the n-by-n identity matrix is denoted I n (the 
subscripts m and n might be discarded if the dimension is clear from the context). 

2 Robust Recursive NMF Algorithm under Separability 

In this section, we analyze a family of simple recursive algorithm for NMF under the separability 
assumption; see Algorithm [TJ Given an input data matrix M and a function /, it works as follows: 

Algorithm 1 Recursive algorithm for separable NMF 

Input: Separable matrix M = WH £ M™ xn (see Assumption [1]) , the number r of columns to be 

extracted, and a strongly convex function / satisfying Assumption 
Output: Set of indices J such that M(:, J) = W up to permutation and scaling; cf. Theorems Q] and [3J 

1: Let R = M, J = {}, j = 1. 
2: while R ^ and j < r do 
3: j* = argmaxj- f(R-.j). f 

4: '', />':/•• 

5: R^(l- ^V) R- 

6: J = JU{j*}. 

7: 3=3 + 1- 

8: end while 

f In case of a tie, we pick an index j such that the corresponding column of the original matrix M 
maximizes /. 



at each step, the column of M maximizing the function / is selected, and M is updated by projecting 
each column onto the orthogonal complement of the selected column. 

In Section 12. 1\ we discuss the assumptions on the input separable matrix M = WH and the 
function / that we will need in Section [22] to prove that Algorithm [1] is guaranteed to recover columns 
of M corresponding to columns of the matrix W. Then, we analyze Algorithm Q] in case some noise is 
added to the input separable matrix M, and show that, under these assumptions, it is robust under 
any small perturbations; see Section [2.31 Finally, we compare our results with the ones from [31E] in 
Section El 

2.1 Separability and Strong Convexity Assumptions 

In the remainder of the paper, we will assume that the original data matrix M = WH is separable, 
that is, each column of W appears as a column of M up to a scaling factor. Recall that this condition is 
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implied by the pure-pixel assumption in hyperspectral imaging; see Introduction. We will also assume 
that the matrix W is full rank. This is often implicitly assumed in practice otherwise the problem is 
in general ill-posed, because the matrix H is then typically not uniquely determined; see, e.g., [31 12T]. 

Assumption 1. The separable matrix M E ]j mxn can fr e written as 

M = WH = W[I r ,H'\, 

where W E flj mxr fi as rank r, H E M^_ xn ; and the sum of the entries of each column of H' is at most 
one, that is, Y^k=i^kj — 1 or > equivalently, h'j E A r Vj. 

The assumption on matrix H is made without loss of generality by 

1. Permuting the columns of M so that the first r columns of M correspond to the columns of W 
(in the same order). 

2. Normalizing M so that its columns sum to one (except for its zero columns). In fact, we have 
that 

M = WH MD^ = WD^(D W HD~^), 



where 

(Dxh 



if i = j and X-i ^ 0, 
1 if i = j and X-i = 0, 
otherwise. 



By construction, the columns of MD~j^ and WD^} sum to one (except for the zero columns 
of M), while the columns of (DyyHDTf) have to (except for ones corresponding to the zero 
columns of M which are equal to zero) since M = WH. 

In the hyperspectral imaging literature, the entries of each column of matrix H are typically 
assumed to sum to one, hence Assumption [1] is slightly more general. This has several advantages: 

• It allows the image to contain 'background' pixels with zero spectral signatures, which are present 
for example in hyperspectral images of objects in outer space (such as satellites). 

• It allows us to take into account different intensities of light among the pixels in the image, e.g., 
if there are some shadow parts in the scene or if the angle between the camera and the scene 
varies. Hence, although some pixels contain the same material(s) with the same abundance(s), 
their spectral signature could differ by a scaling factor. 

• In the noisy case, it allows us to take into account endmembers with very small spectral sig- 
nature as noise, although it is not clear whether relaxing the sum-to-one constraint is the best 
approach [5]. 

Remark 1. Our assumptions actually do not require the matrix M to be nonnegative, as W can be 
any full-rank matrix. In fact, after the first step of Algorithm [7J the residual matrix will typically 
contain negative entries. 

We will also need to assume that the function / in Algorithm [1] satisfies the following conditions. 

Assumption 2. The function f : R m — > R + is strongly convex with parameter [i > 0, its gradient is 
Lipschitz continuous with constant L, and its global minimizer is the all-zero vector with /(0) = 0. 
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Notice that, for any strongly convex function g whose gradient is Lipschitz continuous and whose 
global minimizer is x, one can construct the function f(x) = g(x + x) — g(x) satisfying Assumption 
In fact, /(0) = while f(x) > for any x since g(x + x) > g{x) for any x. Recall that a function is 
strongly convex with parameter [i if and only if it is convex and for any x, y G dom(/) 

f(5x + (l-5)y) < 5f(x) + (l-5)f(y)-^5(l-5)\\x-y\\l for any 5 G [0, 1]. 

Moreover, its gradient is Lipschitz continuous with constant L if and only if for any x, y G dom(/) 

\\Vf(x)-Vf(y)\\ 2 <L\\x-y\\ 2 . 
Convex analysis also tells us that if / satisfies Assumption [2] then, for any x,y, 

f( x ) + Vf(x) T (y-x) + ^\\x-y\\ 2 2 < f(y) < f(x) + Vf(xf(y - x) + || \x - y\ & 
In particular, taking x = 0, we have, for any y G M m , 

flMli < f(v) < §IMII, (i) 

since /(0) = and V/(0) = (because zero is the global minimizer of /). 
2.2 Noiseless Case 

We now prove that, under Assumption [T] and [2J Algorithm [T] recovers a set of indices corresponding 
to the columns of W. 

Lemma 1. Let Y = [W, m x(r-k)] = [w\W2 ■■■ wj, mx ( r _ fe )] G M™ xr with W{ ^ Vi and u>j / tOj 
Vz ^ j, and Zei / : M m — > R+ 6e a strongly convex function with /(0) = 0. TTien 

f(Yh) < max. f(wi) for all h G A r suc/i i/iai /i 7^ e^Vj, 
where e,- is i/ie j'i/i column of the identity matrix. 

Proof. By assumption on /, we have f{w) > for any w 7^ 0; see Equation ([1]). Hence, if K/i = 0, 
we have the result since f(Yh) = < f(wi) for all i. Otherwise Yh = X^=i w ihi where hi ^ for at 
least one 1 < i < k so that 

(k / k \ \ k 

h tWi + 1 - hi < K) < max /(«*). 

8=1 V 1=1 J J 1=1 

The first inequality is strict since h 7^ efij and hi 7^ for at least one 1 < i < k, and the second 
follows from the fact that Y2i=i hi < Y2i=i hi <1- □ 

Theorem 1. Let the matrix M = WH satisfy Assumption^ and the function f satisfy Assumption^ 
Then Algorithm^ recovers a set of indices J such that, up to permutation, M(:, J) = W . 

Proof. Let us prove the result by induction. 

First step. Lemma [1] applies since / satisfies Assumption [2] while W is full rank. Therefore, the 
first step of Algorithm [1] extracts one of the columns of W . Assume without loss of generality the last 
column w r of W is extracted, then the first residual has the form 



1 ~ ^W) M=(l- p£) WH = [W mxl ]H = WUH, 

\\ w r\\2/ \ \\ w r\\2/ 



6 



i.e., the matrix is obtained by projecting the columns of M onto the orthogonal complement of 
w r . We observe that = W^H satisfies the conditions of Lemma Q] as well because W is full 
rank since W is, and the matrix H is unchanged. This implies, by Lemma [H that the second step of 
Algorithm [T] extracts one of the columns of W . 

Induction step. Assume that after k steps the residual has the form R^ = [W* O mX k]H with 
W* full rank. Then, by Lemma HJ the next extracted index will correspond to one of the columns 
of W* (say, without loss of generality, the last one) and the next residual will have the form R = 
[W\O mx ( k+1) ]H where W"t full rank since W* is, and H is unchanged. By induction, after r steps, 
we have that the indices corresponding to the different columns of W have been extracted and that 
the residual is equal to zero (R = mxr H). □ 



2.3 Adding Noise 

In this section, we analyze how perturbing the input data matrix affects the performances of Algo- 
rithm [TJ We are going to assume the input perturbed matrix M' can be written as M' = M + N 
where M is the noiseless original separable matrix satisfying Assumption [U and N is the noise with 
H^ilb < e for all i for some sufficiently small e > 0. 

2.3.1 Analysis of a Single Step of Algorithm [1] 

Given a matrix W, we introduce the following notations: j(W) = min^j \\wi — Wj\\2, v{W) = 
minj ||wi||2, u)(W) = min {^(VF), ^tC^O}) and K(W) = maxj | 2 . 

Lemma 2. Let Y = [W,Q] where W £ R mxfe a nd Q G W nx( - r ~ k \ and let f satisfy Assumption® 
with strong convexity parameter n and its gradient having Lipschitz constant L. If 

u(W) > 2J - K(Q) , 



then, for any < 5 < \, 



satisfies 



f* = max f(Yx) such that Xi < 1 — 6 for 1 < i < k, 

xeA r 



f* <^f{ Wi )-\ ^{1-8)5 uj(W) 2 . (3) 



Proof. By strong convexity of /, the optimal solution x* of ([2]) is attained at a vertex of the feasible 
domain {i£l r | xi > OVi, YH=i x i < ^-i x i < ^ ~ 81 <i < k}, that is, either 



(a) 


x* 


(b) 


X* 


(c) 


X 


(d) 


* 

X 


(c) 


X 



(1 - 5)ej for 1 < j < k, 

5ei + (1 — 5)ej for 1 < i,j < k, or 

5ei + (1 — 5)ej for k + 1 < i < r and 1 < j < k. 
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Before analyzing the different cases, let us provide a lower bound for /*. Using Equation ([I]), we have 

f((l - 6) Wi ) >±fx(l - 6) 2 \\w i \\l 

Since (1 — S)wi is a feasible solution and < 5 < |, this implies /* > ^K(W) 2 . Recall that since / is 
strongly convex with parameter /i, we have 

/ (6y + (1 - S)z) < 6f(y) + (1 - S)f(z) - ±1*6(1 - 6)\\y - z\\\. 
Let us now analyze the different cases. 

(a) Clearly, x* ^ since f(0) =0 and f(y) > for all y / 0, cf. Equation 

(b) Yx* = qi for some i. Using Equation ([1]), we have 

r = /(«) < §ii*ni < §^(Q) 2 < ^{wf < ^Kiwf < r, 

since v(W) > 2*/j-K(Q), a contradiction. 



(c) Yx* = (1 — 6)wi for some i : 

f* <{l-5)f{w i )- l -^{l-5)\\w l \\l (4) 
By strong convexity, we also have f(wi) > f > ^(1 — 6)\\wi\\2- Plugging it in fl3J) gives 

/* < f(w i )-5f(w i )-\^{l-5)\\w i \\l < f{w i )-ii6{l-6)\\w i \\l < m^f( Wi )-^(l-5)u(Wf. 



(d) Yx* = 5wi + (1 — 5)wj for some i ^ j : 

- 6)f(wA - 



f < Sf(wi) + (1 - 6)f( Wj ) - ^6(1 - S)\\wi - Wj \\l 



1 



< max/K) - - 6) I 

(e) Yx* = 5qi + (1 — 6)vjj for some i, j. First, we have 

/* < + (1 - *)/K0 - \»6(1 - 5)\\ qi - Wj \\l 

< /K) + - Sf( Wj ) - \n6(l - S)(u(W) - K{Q)f 
In fact, \\qi —Wj\\2 > H^jlh — llftlk > z'(W) — K(Q) > 0. Then, using 

/(«) < §||%HI < ±K(Q) 2 < \^{Wf < \f( Wj ), 

and v(W) - K(Q) > (l - hJ~*z) u ( w ) > \ v { w )> we obtain 

3„„, . 1 



/* < f(wj) - - A 6f{ Wj ) - -^(1 - 5)u(W) 2 
1 



<f(w j )-^6(l-6)u(W) 2 , 

since f(wj) > % (1 - 6)u(W) 2 . (Note that the last inequality can be improved by replacing i 
with | ( 1 — k/ ^ J . However, since < ^ < 1, we have ^ < l _ |yf < 1] and this could 



only improve the second term in the bound by a constant factor; at the limit, from | to |.) 



□ 

Remark 2. If Q = in Lemma\^ case (e) in the proof reduces to case (c) so that the bound ([3]) can 
be improved to 

f* < max f(wi) -n (1-6)8 co(W) 2 . 

i 

Lemma 3. Let the function f : M m — > R + satisfy Assumption with strong convexity parameter fx 
and its gradient having Lipschitz constant L. Then, for any \\x\\2 < K and \ \n\\2 < e < K, we have 

L 2 \ . _ 3 



/(z + n) </(*) + UKL + -e z 1 <f(x) + -eKL, and (5) 

f(x + n)> f{x) - (eKL - ^e 2 ) > f(x) - eKL. (6) 

Proof. For the upper bound flSJ), we use the fact that the gradient of / is Lipschitz continuous with 
constant L 

f{x + n)< fix) + Vfixfn + ^\\n\\ 2 < f(x) + eKL + ^e 2 < f(x) + ^eKL, 

for any ||x||2 < K, \\n\\2 < e < i'T. The second inequality follows from the fact that V/(0) = and 
by Lipschitz continuity of the gradient: ||V/(x) — 1 1 2 < L\\x — 0| 1 2 < LK for any ||x||2 < K. 
For the lower bound ([6]), we use strong convexity 

fix + n) >/(x) + V/(x) T n+|||n|| 2 > - i^L||n|| 2 + |||n|| 2 > f(x) - (eKL-^e 2 

for any 1 1 a? 1 1 2 < K, ||n||2 < e < K. The third inequality follows from the fact that 

max (yKL - -y 2 ) = eKL - -e 2 , 

since K > e and L > fi. □ 

We can now prove the theorem which will be used in the induction step to prove that Algorithm [1] 
works under small perturbations of the input separable matrix. 

Theorem 2. Let 

• / satisfy Assumption [2 with strong convexity parameter \i, and its gradient have Lipschitz 
constant L. 

• M' = M + N with M = YH = [W Q]H, where W £ R mxk , K(N) < e, u{W) > 2^K(Q), 

and H = [I, H'] G R^_ xn where the sum of the entries of the columns of H' is at most one. We 
will denote uj(W) and K (W), oj and K respectively. 

2 

• e be sufficiently small so that e < ^okl • 

Then the index i corresponding to a column of M' that maximizes the function f satisfies 

mi = [W, Q]hi, where hi(p) > 1 — 5 for some 1 < p < k, and 5 = — , (7) 

which implies 

|K-«;p||a < e + 2K5 = el 1 + 20— ^- J . (8) 
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Proof. First notice that e < 2 qkl i m P nes = 1( ^J> L < \- Let us then prove Equation ([7]) by 
contradiction. Assume the extracted index, say i, for which = mi + rtj = Yhi + satisfies 
/ii(Z) < 1 - 5 for 1 < I < k. We have 

f(. m 'i) = f( m i + n i) 

< f(Yhi) + ^eKL 

< ( max f(Yx) s.t. x(l) <l-5f or KKM + -eKL 
\ v seA'' ~ ~ ~ / 2 

1 3 

< max f(wj) — -p5(l — 5)oj 2 + -eKL 



2' 

< max f[w'j) -\n8{l-8)u 2 + ^eKL, (9) 



1 ... „ 2 5 



where u/- is the perturbed column of M corresponding to (that is, the jth column of M). The 
first inequality follows from Lemma [3l In fact, we have e < K since fj, < L and a; < K, ||mj||2 = 
||jy/ij||2 < maxj 1 1 i 1 1 2 = K (by convexity of ||.||2)> and 1 1 rx^ 1 1 2 < e so that f(m'j) < f(jrii) + \eKL. 
The second inequality is strict since the maximum is attained at a vertex with x(l) = 1 — 5 for some 
1 < I < k at optimality (see proof of Lemma [2]) • The third inequality follows from Lemma [2] while the 
fourth follows from the fact that 1 1 ifj^ 1 1 2 < K so that f(wj) — eKL < f(w'j) for all j by LemmaO 
We notice that, since 5 < ^, 

-fi5{l - 5)u > -fjw 5 = -auj — = -eKL. 

2 4 4 pu z 2 

Combining this inequality with Equation ([U]), we obtain f[m'j) < maxj f(w'j), a contradiction since 
m\ should maximize / among the columns of M' and the w'j's are among the columns of M' . 
To prove Equation (JSj) , we use Equation ([7]) and observe that 

rrii = (1 — 5')w p + OikUk for some p and 1 — 5' > 1 — 5, 

k^p 

so that X^fc^p a k < < 8. Therefore, 

\\m - w p \\ 2 = 

which gives 



5'w p + ^2 a k w k 

k^p 



< 25' max \\wj\\ 2 < 25' K < 2K5, 

j 

2 



\m. 



\ — Wp\\2 < \\{fn'i — mi) + (mj — w p ) 1 12 < e + 2K5, for some 1 < p < k. 



□ 

Notice that this result is invariant to the scaling of matrix M' since multiplying M' by a constant 
amounts to multiply e, K{W) and u(W) by the same constant. 

Remark 3. If p = L (see Section \4.1\ ), the bounds obtained in Theorem^ can be slightly improved. In 
fact, using the tighter bounds from Equations ([5]) and ([6]) ; the bound of Equation ([9]) can be improved 
to 

/K) < nwx/K-) - \^5{l - 5)u 2 + [eKL + ^e 2 ) + (eKL - |e : 

= max/K-) - -fi5(l - 5)oo 2 + 2eKL. 
j 2 
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Hence, we only need 



. 8KLe 1 , u 2 u 
o = — ^ — < — , that is, e < 



uj 2 fi ~ 2' ' 16KL' 

so that for the extracted index i, there exists some p such that 

( K 2 L 

IK -w p \\ 2 <e + 2Kb = e 1 + 16-=-- 

V w /■*, 

Remark 4. If Q = 0, we can ge£ rid o/ the factor ^ (see .Remark HP so that we only need 

. 5KLe 1 , u?ix 
= — 7) — < — , that is, e < 



w 2 /i - 2' ' 10KL' 

: tiitu 

be improved to 



and e + 2K5 = e ( 1 + 10-^- — J . Combining this result with Remark^ above, if jx = L, the bound can 



t AKLe 1 , lj 2 u 
d = — ^ — < — , that is, e < 



u 2 n ~ 2' ' 8KL' 



and e + 2K5 = e (1 + 8^) . 



It is interesting to relate the ratio ^9ppP to the condition number of matrix VF, given by the ratio 
of its largest and smallest singular values k(W) = °J^X ■ 

Lemma 4. Let W = [w\ w 2 ■ ■ ■ w r ] £ M mxr . Then 

w(W) > a r (W). 

Proof. We have to show that 

H^ilb — cr r (W) for all i, and \\u)i — Wj\\2 > V2a r (W), for all % / j. 

Let (U, X, V) G M. mxr x IR rxr x M rxr be a compact singular value decomposition of W = ITEV T , where 
f7 and V are orthonormal and £ is diagonal with the singular values of W on the diagonal. Then 

||«>i||2 = ||^S^|| 2 = ||SUt||2 > M^OINb = 0- r (W), 

while 

\\wi - Wj\\ 2 = \\UH{vi - Vj)\\ 2 = ||S(uf - w 3 -)||2 > £r r (W)||ui - Vj\\ 2 = V2a r (W). 

□ 

The ratio -^j^y is then closely related to the conditioning of matrix W £ W mxr . In fact, we have 
seen that u(W) > a r (W) while, by definition, ai(W) > K{W) > v(W) > oj(W) so that 



" u(W) ~ a r (W) - K[Wy 
In particular, this inequality implies that if k(W) = 1 then -^jfj- = 1- 
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2.3.2 Error Bound for Algorithm Q] 

We have shown that, if the input matrix M' has the form 

M' = [W,Q][I r ,H'] + N, 

where Q and N are sufficiently small and the sum of the entries of each column of H ' > is smaller 
than one, then Algorithm [T] extracts one column of M' which is close to a column of W; cf. Theorem^ 
We now show that, at each step of Algorithm [H the residual matrix satisfies these assumptions so 
that we can prove the result by induction. 
We first give some useful lemmas. 

Lemma 5 (Cauchy Interlacing Theorem). Let W G ]R mxr an d u £ jj m w %tk \\u\\ 2 = 1. Then 

a\{W) > ai((I - uu T )W) > tr r _i((7 - uu T )W) > a r {W). 

Lemma 6 (Singular Value Perturbation, Weyl). Let M' = M + N £ M. rxn with r < n. Then, for all 
1 < i < r, 

\ui{M) - Ui{M')\ < a^N) = 



Lemma 7. Let W = [Wi, w r ] G R mxr where W\ G M mx ( r - 1 ) and w r G M m . Let also P = [I 

complement of u with \ \ u 

CT r _i(PVFi) > a r (W)-e 



be the projection onto the orthogonal complement of u with \ \u — uvlh < e for some e > 0. Then, 



Proof. We have 

<r r _i(/Wi) = cJ r _ 1 ([P^i,0 mx i]) 

> a r - 1 ([PW 1 ,Pw r ]) - \\Pw r \\ 2 = a r -i{PW) - \\Pw r \\ 2 

> a r (W)-e. 

The first inequality follows from Lemma El and the second follows from Lemma [5] and the fact that 
H-Pwvlh < e since \\u — w r \\2 < e. □ 

We can now prove the main theorem of the paper which shows that, given a noisy separable 
matrix M' = M + N = WH + iV where M satisfies Assumption [H Algorithm Q] is able to identify 
approximately the columns of W . 

Theorem 3. Let M' = M + N = WH + N G R mxn where M satisfies AssumptionUlwith W G M mxr , 
r > 2, and H = [L r H'\, and let f satisfy Assumption^ with strong convexity parameter [i and its 
gradient has Lipschitz constant L. Let also ||^i||2 < e f or all i with 

£<MW)m ^_L_,^( 1+8 „W)- 1 , m 

and J be the index set of cardinality r extracted by Algorithm^ Then there exists a permutation P 
of {1, 2, . . . , r} such that 

( K(W) 2 L\ 
max \\m J{j) - Wp(j) \\ 2 <e = e(l + SO^y- ) . 
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Proof. Let us prove the result by induction. First, let us define the residual matrix obtained after 
k steps of Algorithm Q] as follows: 

= M' , R {k+ V = p( k )R( k ) for 1 < k < r - 1, 

with = — j^jj?) is the orthogonal projection performed at step 5 of Algorithm [1] where u is 

the extracted column of R^ k \ that is, u = r- for some 1 < i < n. 

Then, let us assume that the residual R^ G W nxn has the following form 

R (k) = ^(fc) s Q(fc)]# + 



where G R" 1 ^^^ q(*) g r™x^ K{Q^) < e, ^(H^W) > and < e < 



Let us show that R^ k+1 ^ satisfies the same conditions as R^ k K By assumption, satisfies the 

conditions of Theorem [2} the {k + l)th index extracted by Algorithm [H say i, satisfies 



\rf ] -w^\\ 2 < e 1 + 20 



K(W^) 2 L\ 



for some 1 < p < r — k. Let us assume without loss of generality that p = r — k. The next residual 
R( k+1 ) has the form 

R (k+i) = p {k) R {k) = [p%^. ) ^^i) i P^w {k \ pWQW]H + p( fc y*0 , (11) 

Q(fc+1) N(k+1) 

where 

• ^(iV^ 1 )) < K(N { V) < e and K(P^Q^) < K(QW) < I because an orthogonal projection 
can only reduce the ^-norm of a vector. 

• l|P (fc) 4 fc) || 2 < e - = e (l + 80f^f) since 

\\p( k ) w ( k )\\o < \\r {k) - w^\\o <e(l + 2Q K ( Wik) ^-} 
\\r ^112^11^ w p || 2 _e ^i + zu^^ 

because P^w { p k) is the projection of onto the orthogonal complement of and by Theo- 
rem [21 and 

K{W^ f < K(W) 2 
uj(W( k )) 2 ~ a 2 {W) ' 

In fact, we have that K(W^) < K(W) because of the orthogonal projections, while u>(W^) > 
^oy(W) follows from 

u(W {k) ) > a r ^ k {W {k) ) > o-r-fc+iCW^ -1 )) - e > a r (W) - ke > u r {W) - (r - l)e > ~a r (W). 

The first inequality follows from Lemma 01 the second and third from Lemma [71 while the last 
follows from 



2(r - 1) - 2(r - 1) V o?(W0 A* 

For to satisfy the same conditions as R^ k \ it remains to show that u{W^ k+1 ^) > and 

e < 20K(w , ( fc +i))L ' us snow * na, t * n i s holds for all k = 0, 1, . . . , r — 1 : 
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Since u(W^) > w(W<®) > ha r (W) (see above), u(W^) > 2, Me is implied by 



Since w(W<*>) > |<r r (W) and K (W<M) < K{W) (see above), e < 2 okxh/W)l * s i m P n ed by 



2(r-l)V a?(W)iiJ ~ ; 80K(IU)L' 

By assumption on the matrix M', these conditions are satisfied at the first step of the algorithm 
(we actually have that Q(°> is an empty matrix), so that, by induction, all the residual matrices satisfy 
these conditions. Finally, Theorem [2] implies that the index i extracted by Algorithm [1] at the (fc + l)th 
step satisfies 

rf = \WW QW]/H + n u where hi (p) > 1 - &\ *W = , 

and p = r — k without loss of generality (see above). Since the matrix H is unchanged between each 
step, this implies mj = Wh{ where hi(p) > 1 — 5^ k \ hence 

\\m'i - w p \\2 = \\m'i — mi + rrii — w p \\2 
< e + \ \rrii - w p \\2 
<e + 25 (k) K(W) 

<e(l + 80^ 2L 



a*(W) n) ' 

The last inequality follows from K(W^) < K(W) and u(W^) > \cr r (W) (see above). □ 

Remark 5 (Improving the Bound). It is possible to improve the bounds of Theorem^ although we 
believe it is unfortunately not possible to get rid of the factor r. For example, at the first step, we have 
w(W(°)) > a r {W) ande = so that |K J(1) - w P(1) || 2 < e = e (l + 10^^ jjY- see Remark^ 

At the last step, W^' 1 ^ is made up of one (non-zero) column so that ~r^F(^Tyj = 1- Hence, for 

the last extracted index, we actually have Wm'j^ — ^p^lb — e + 80 ^^ j^j (cf. last equation of 
the proof of Theorem^). 

Also, in case \x = L, the constant 80 can be replaced with 64; see Remark^ 



2.4 Bounds for Separable NMF and Comparison with the Algorithms of Arora et al. 
and Bittorf et al. © 

Given a noisy separable matrix M' = M + N = WH + N, we have shown that Algorithm [1] is 
able to approximately recover the columns of the matrix W (Theorem [3]) . Using this result, it is 
straightforward to obtain an approximate NMF (U, V) > of M' ~ UV. 
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Theorem 4. Let M' = M + N = WH + N € M™ xn , e = K(N) and f satisfy the conditions of 
Theorem^ Let also J be the index set of cardinality r extracted by Algorithm^ U = M'(:, J), and 

V = argmin XgR rxn \\M' — UX\\p. 
Then, (U, V) > is an NMF of M' satisfying 

max \\m'i - U Vi \\ 2 < e ( 2 + 80 . (12) 

l<i<n \ 0-f(W) [I ) 

Proof. By Theorem [3j there exists a permutation matrix P £ {0, l} rxr such that 



max \\Ui - Wpi 2 < e = e 1 + 80 
Since V = argmin Xg]K rxn ||M' — J7X||i?, we have, for alH = 1, 2, . . . , n, 

'Im'j - J7fj||2 < - C//ii|| 2 



< ||m- - WP^ + WP/ii - Uhi\\ 2 
<e + \\(WP-U)hi\\ 2 <e + e. 

The last inequality follows from the fact that hi £ A r so that, by convexity of ||.||2> ||(^P — U)hi\\2 < 
maxj \\Wpi - Ui\\ 2 < e. □ 

In order to compare the bound (|12p with the ones obtained in [31 [6], let us briefly recall the 
results obtained in both papers: Given a separable matrix M = WH, the constant a is defined as the 
minimum among the l\ distances between each column of W and its projection onto the convex hull 
of the other columns of W. Without loss of generality, it is assumed that the l\ norm of the columns 
of W is equal to one (by normalizing the columns of M), hence a < 2. Then, given that the noise N 
added to the separable matrix M satisfies 

ei = max||n,||i < 0(a 2 ), 

i 

Arora et al. [3] generate an NMF (U, V) of the noisy separable matrix M' = WH + N such that 

max \ \m'i — E/vdh < 10— + 7ei, 
i a 

while Bittorf et al. [6 J provide an NMF (U, V) satisfying 

max \ \m'i — Uvi\\i < 4ex- 

i 

Let us compare these bounds to ours. Since the l\ norm of the columns of W is equal to one, 
we have a r (W) < K(W) < 1. Moreover, denoting pi the orthogonal projection of w% onto the linear 
subspace generated by the columns of W but the ith, we have 

a r (W) < min \ \wi — Pi\\ 2 < min \ \wi — Pi\\i < ct. 

i i 

By Theorem [31 Algorithm [1] therefore requires 

e 2 = max 2 < O — — - < O — 
i V r / \ r 
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to obtain an NMF (U, V) such that 



max 1 1 m i 

i 



UviWi < O 



£2 



o*(W) 



This shows that the above bounds are tighter, as they only require the noise to be bounded above by 
a constant proportional to a 2 to guarantee an NMF with error proportional to ^ or ei. In particular, 
if W is not full rank, Algorithm [1] will fail to extract more than rank(VF) columns of W, while the 
value of a can be much larger than zero implying that the algorithms from [3j [6] will still be robust 
to a relatively large noise. 

To conclude, the techniques in [3j [6] based on linear programming lead to better error bounds. 
However, there are computationally much more expensive (at least quadratic in n, while Algorithm [T] 
is linear in n, cf. Section fl.ip . and have the drawback that some parameters have to be estimated in 
advance: the noise level ei, and 

• the parameter a for Arora et al. [3] (which is rather difficult to estimate as W is unknown), 

• the factorization rank r for Bittorf et al. [60, hence the solution has to be recomputed from 
scratch when the value of r is changed (which often happens in practice as the number of 
columns to be extracted is typically estimated with a trial and error approach). 

Table [1] summarizes these results. Note that we keep the analysis simple and only indicate the growth 
in terms of n. The reason is threefold: (1) in many applications (such as hyperspectral unmixing), n 
is much larger than m and r, (2) a more detailed comparison of the running times would be possible 
(that is, in terms of m, n, and r) but is not straightforward as it depends on the algorithm used to 
solve the linear programs (and possibly on the parameters a and ei), and (3) both algorithms [3, 6J 
are at least quadratic in n (for example, the computational cost of each iteration of the first-order 
method proposed in [B] is proportional to mn 2 , so that the complexity is linear in m). 



Table 1: Comparison of robust algorithms for separable NMF (columns of the input matrix M are 
assumed to be normalized so that a r (W) < a) 





Arora et al. [3] 


Bittorf et al. [6] 


Algorithm Q] 


Flops 


Q(n 2 ) 


Q(n 2 ) 


0(n) 


Memory 


<D(n) 


0(n 2 ) 


0{n) 


Noise 


ei < 0(a 2 ) 


ei < 0(a 2 ) 




Error 


Mi < oft) 


ll-lli < 0{e x ) 


\Yh<o{^) 


Parameters 


ei, a 


e 1} r 


r 



3 Outlier Detection 

It is important to point out that Algorithm Q] is very sensitive to outliers, as are most algorithms 
aiming to detect the vertices of the convex hull of a set of points, e.g., the algorithms from [31 [6] 
discussed in the previous section. Therefore, one should ideally discard the outliers beforehand, or 

1 In their incremental gradient descent algorithm, the parameter e does not need to be estimated. However, other 
parameters need to be tuned, namely, primal and dual step sizes. 
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design variants of these methods robust to outliers. In this section, we briefly describe a simple way 
for dealing with (a few) outliers. This idea is inspired from the approach described in |15| . 

Let us assume that the input matrix M is a separable matrix WH = W[I, H'\ satisfying Assump- 
tion [T] to which was added t outliers gathered in the matrix T G R mxt : 



where b! i G A r for all i, hence g[ G A r for all i. Assuming [W, T] has rank r + 1 (hence t < m — r), the 
matrix M above also satisfies Assumption [TJ In the noiseless case, Algorithm [1] will then extract a set 
of indices corresponding to columns of W and T (Theorem [1]) . Therefore, assuming that the matrix 
H' has at least one non-zero element in each row, one way to identifying the outliers would be to 

(1) Extract r + t columns from the matrix M using Algorithm [TJ 

(2) Compute the corresponding optimal abundance matrix G, and 

(3) Select the r columns corresponding to rows of G with the largest sum, 

see Algorithm[2] It is easy to check that Algorithm[2]will recover the r columns of W because the opti- 

Algorithm 2 Recursive algorithm for separable NMF with outliers 

Input: A separable matrix M G R™ xn containing at most t < m — r outliers, the number r of columns 

to be extracted, and a strongly convex function / satisfying Assumption [2j 
Output: Set of indices J' such that M(:, J') = W up to permutation and scaling; cf. Theorem [5j 

1: Compute J, the index set of cardinality (r + t) extracted by Algorithm [TJ 

2: G = argmin a .. 6Ar+ t Vi \\M - M(:, J)X\\%. 

3: Compute J' C J, the set of r indices having the largest score \\G(j, :)||i among the indices in J. 

mal solution G computed at the second step is unique (since [W, T] is full rank), hence ||G(j,:)||i > 1 f° r 
the indices corresponding to the columns of W while ||G(j,:)||i = 1 for the outliers; see Equation (fT3|) . 

In the noisy case, a stronger condition is necessary: the sum of the entries of each row of H' must 
be larger than some bound depending on the noise level. In terms of hyperspectral imaging, it means 
that for an endmember to be distinguishable from an outlier, its abundance in the image should be 
sufficiently large, which is perfectly reasonable. 

Theorem 5. Let M' = [W,T,WH'\ + N e M. mxn where [W,WH'] satisfies AssumptionUlwith W G 
i mxr , T G M mxt , r > 2, and let f satisfy Assumption^ with strong convexity parameter fi and its 
gradient has Lipschitz constant L. Let also denote B = [W,T], s = r+t, and 1 1 rx^ 1 1 2 < e for all i with 



M = [W, T, WH'] 




(13) 



= [W,T][L r+t ,G'] = [W,T]G, 




and J be the index set of cardinality r extracted by Algorithm^ Lf 



2(2n-t-r)—-^-<\\H'{i,:)\\ 1 for all i 



(14) 



o- s {LI) 

then there exists a permutation P of {1,2, ... , r} such that 




17 



Proof. By Theorem[3j the columns extracted at the first step of Algorithm [2] correspond to the columns 
of W and T up to error e. Let then W + Nw an d T + Nt be the columns extracted by Algorithm [T] 
with K(N W ),K(N T ) < e. 

At the second step of Algorithm [2J the matrix G is equal to 

G = argmin 2/ . 6Ar+tVi \\M' - [W + N W ,T + iV T ]^|||, 

up to the permutation of its rows. It remains to show that 

mm ||G(*,:)||i > max \\G(i, :)||i, (15) 

l<j<r r+l<i<r+t 

so that the last step of Algorithm [5] will identify correctly the columns of W among the ones extracted 
at the first step. We are going to show that 



G 



I r rxt H' 

Oixr It Ojxr 



More precisely, we are going to prove the following lower (resp. upper) bounds for the entries of the 
first r (resp. last t) rows of G: 



For 1 < i < r, G y > max (0, - 2^§y J for all j. 



• For r + 1 < % < r + t, G tj < min (1, F, L j + 2^yJ for all j. 
Therefore, using the condition (|14p . we obtain 

min ||G(i,:)||i> (l - 2-^- )+ \\H% :)||i - 2(n ' 



e + e 



> l + 2(n - 1) 



o-JB) 



> ,max ||G(i,:)||i, 

which proves the result. 

It remains to provide the bounds for the entries of G. First, we have that 



m 



, j -[W + N w ,T + N T ]G :j || a < e + e, for all j, (16) 



which can be proved using the same derivations as in the proof of Theorem Then, let us prove the 
upper bound for the block of matrix G at position (2, 1), that is, let us prove that 

Gn < 2 € + € , for all r + 1 < i < r + 1 and 1 < j < r. 
os{B) 

(Note that < G < 1 by construction, hence some of the bounds are trivial, e.g., for the block (1,2).) 
The derivations necessary to obtain the bounds for the other (non-trivial) blocks are exactly the same 
and are then omitted here. Let r + 1 < i < t and 1 < j < r and denote Gy = 5, and let also 
I = {1, 2, . . . , r + t}\{i}. We have 

Wm'j - [W + N W ,T + N T ]G :j \\ 2 = \\(wj + n 3 ) + B(:, I)G(I,j) + B(:, i)S + [N w , N T ]G :j \\ 2 

> \\ Wj + B(:,I)G(I,j) + B(:,i)5\\ 2 - e - e 

> min 8\\B(:,I)y-B(:,i)\\ 2 -e-e 



> 5<r s (B)-e-e. 

The first inequality follows from K(N) < e, K([N W ,N T \) < e and G :j £ A r+t , while the second 
inequality follows from the fact that Wj is a column of B(:,I). The last inequality follows from the 
fact that the projection of any column of B onto the subspace spanned by the other columns is at 
least o s {B). Finally, using Equation (fTUj) . we have e + e > 5a s (B) — e — e, hence Gy = 5 < 2 ff £ ^) • D 
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4 Choices for the Function / and Related Methods 



In this section, we discuss several choices for the function / in Algorithm [JJ and relate them to existing 
methods. 

4.1 Best Choice with fi = L: f(x) = ||a:||| 

According to our derivations (see Theorem , using functions / whose strong convexity parameter \i 
is equal to the Lipschitz constant L of its gradient is the best possible choice (since it minimizes the 
error bounds). The only function satisfying Assumption [5] along with this condition is, up to a scaling 
factor, 

m 

f( x ) = Nil = ^xj. 

i=i 

In fact, Assumption [2] implies tjH^HI < f{x) < see Equation flJJ. However, depending on the 

problem at hand, other choices could be more judicious (see Sections 14.21 and l4.3|) . 

It is worth noting that Algorithm [1] with f(x) = has been introduced and analyzed by several 
other authors: 

• Choice of the Reflections for QR Factorizations. Golub and Businger [5] construct QR 
factorizations of matrices by performing, at each step of the algorithm, the Householder reflection 
with respect to the column of M whose projection onto the orthogonal complement of the 
previously extracted columns has maximum ^2-norm. 

• Successive Projection Algorithm. Araujo et al. [2J proposed the successive projection al- 
gorithm (SPA), which is equivalent to Algorithm [1] with f(x) = H^Hl- They used it for vari- 
able selection in spectroscopic multicomponent analysis, and showed it works better than other 
standard techniques. In particular, they mention 'SPA seems to be more robust than genetic 
algorithms' but were not able to provide a rigorous justification for that fact (which our analysis 
does). This algorithm has been used successfully by other authors for hyperspectral unmixing; 
see, e.g., |26j. 

Ren and Chang [2D] rediscovered the same algorithm, which was referred to as the automatic 
target generation process (ATGP). It was empirically observed in [25] to perform better than 
other hyperspectral unmixing techniques (namely, PPI [7J and VCA pj5]). However, no rigorous 
explanation of their observations was provided. In Section [5j we describe these techniques and 
explain why they are not robust to noise, which theoretically justifies the better performances 
of Algorithm [TJ 

Chan et al. [TTJ analyzed the same algorithm (with the difference that the data is preprocessed 
using a linear dimensionality reduction technique). The algorithm is referred to as the succes- 
sive volume maximization algorithm (SVMAX). They also successfully use Algorithm [TJ as an 
initialization for a more sophisticated approach which does not take into account the pure-pixel 
assumption. 

• Greedy Heuristic for Volume Maximization. Ci vrn an d Magdon-Ismail QJJ] showed 
that Algorithm [TJ with f(x) = is a very good greedy heuristic for the following problem: 
given a matrix M and an integer r, find a subset of r columns of M whose convex hull has 
maximum volume. More precisely, unless V = MV, they proved that the approximation ratio 
guaranteed by the greedy heuristic is within a logarithmic factor of the best possible achievable 
ratio by any polynomial-time algorithm. However, the special case of separable matrices was 
not considered. This is another advantage of Algorithm [TJ even if the input data matrix M is 
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not approximately separable, it identifies r columns of M whose convex hull has large volume. 
For the robust algorithms from [3J [B] discussed in Section 12.44 it is n °t clear whether they will 
be able to produce a meaningful output; see also Section 1531 for some numerical experiments. 



4.2 Generalization: fix) = J2i=i M x » 



Given a one-dimensional function h 
separable function fix) = YliLi h( x i, 
\\x\\ 2 .). For example, we could take 



: R — > M + satisfying Assumption [21 it is easy to see that the 
also satisfies Assumption [2] (notice that h(y) = y 2 gives f(x) = 



Hy) 



y 



a + \y\ 



a > 0, hence f(x) = 



.77 



a + \Xi 



This choice limits the impact of large entries in x, hence would potentially be more robust to outliers. 
In particular, as a goes to zero, f(x) converges to ||x||i while, when a goes to infinity, it converges to 

(in any bounded set). 
Lemma 8. In the ball {y G M. \ \y\ < K}, the function 

y 2 

Hy) = — — , a > o, 



a + \y\ 



is strongly convex with parameter \i = ^q^p and its gradient is Lipschitz continuous with constant 



L 



Proof. On can check that 



2a 2 



(a + Kf 



< h"(y) 



2a' 



(a + \y\ 



< 



a 



for any \y\ < K , 



hence [i 



2a* 



{a+K)i ' and L 



□ 



For example, one can choose a = K for which we have — = 2, which is slightly larger than one but 
is less sensitive to large, potentially outlying, entries of M. Let us illustrate this on a simple example: 



M' 



( 2 

2 
1 

V o 



2 + e \ 
0.5 
2 

1.5 
0.5 / 



/ 2 2 \ 




2 
1 

V o 



i 

2 
2 

1 / 



0.5 

1 0.5 



+ 



/ o 





e 





































Vo 








) 



WH + N. 



(17) 



a?||| recovers the first two columns 



One can check that, for any e < 0.69, Algorithm [T] with f(x) 
of M, that is, the columns of W. However, using f(x) = i+\x \ ' Algorithm [T] recovers the columns 
of for any e < 1.15. Choosing appropriate function f(x) depending on the input data matrix and 
the noise model is a topic for further research. 

Remark 6 (Relaxing Assumption [2]) . The condition that the gradient of f must be Lipschitz continu- 
ous in Assumption^ can be relaxed to the condition that the gradient of f is continuously differentiable. 
In fact, in all our derivations, we have always assumed that f was applied on a bounded set (more 
precisely, the ball {x \ ||x||2 < K(W)}). Since g G C 1 implies that g is locally Lipschitz continuous, 
the condition f G C 2 is sufficient for our analysis to hold. Similarly, the strong convexity condition 
can be relaxed to local strong convexity. 
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It would be interesting to investigate more general classes of functions for which our derivations 
hold. For example, for any increasing function g : M + — > R ; the output of Algorithm [7] using f or 
using (go f) will be the same, since f(x) > f(y) <J=^ g {f(x)) > g (f(y))- Therefore, for our analysis 
to hold, it suffices that (g o /) satisfies Assumption^ for some increasing function g. For example, 
\\x\\2 is not strongly convex although it will output the same result as \ \x\\\ hence will satisfy the same 
error bounds. 



4.3 Using fp-norms 

Another possible choice is 



f(x) = ||x" 2 



v 




For 1 < p < 2, f(x) is strongly convex with parameter 2(p — 1) with respect to the norm ||.|| p 
Section 4.1.1], while its gradient is locally Lipschitz continuous (see Remark [6]). For 2 < p < +00, 
the gradient of f(x) is Lipschitz continuous with respect to the norm ||.|| p with constant 2(p — 1) (by 
duality), while it is locally strongly convex. Therefore, / satisfies Assumption [2] for any 1 < p < +00 
in any bounded set, hence our analysis applies. Note that, for p = 1 and p = +00, the algorithm is 
not guaranteed to work, even in the noiseless case (when points are on the boundary of the convex 
hull of the columns of W): consider for example the following separable matrices 



M 



1 0.5 \ _ / 1 \ / 1 0.5 
1 0.5 / \ 1 / \ 1 0.5 




for which the £i-norm may fail (selecting the last column of M), and 

1 1 )- 

\0 1 0.5 J 

for which the ^oo-norm may fail. Similarly as in the previous section, using f p -norms with p 7^ 2 might 
be rewarding in some cases. For example, for 1 < p < 2, the ^ p -norm is less sensitive to large entries 
of M. Consider the matrix from Equation (|17p : for p = 1.5, Algorithm [T] extracts the columns of W 
for any e < 0.96, while for p = 4, it only works for e < 0.31 (recall for p = 2 we had e < 0.69). 

Algorithm [1] with f(x) = \\x\\ p has been previously introduced as the ^ p -norm based pure pixel al- 
gorithm (TRI-P) PQ, and shown to perform, in average, better than other existing techniques (namely, 
N-FINDR |24| . VCA |19| . and SGA [E]). The authors actually only performed numerical experiments 
for p = 2, but did not justify this choice (the reason possibly is that it gave the best numerical results, 
as our analysis suggests), and did not rigorously explain why Algorithm [1] performs better than other 
approaches in the noisy case. 



5 Numerical Experiments 

In the first part of this section, we compare Algorithm Q] with several fast hyperspectral unmixing 
algorithms under the linear mixing model and the pure-pixel assumption. We first briefly describe 
them (computational cost and main properties) and then perform a series of experiments on synthetic 
datasets in order to highlight their properties. For comparisons of Algorithm [1] with other algorithms 
on other synthetic and real-world hyperspectral datasets, we refer the reader to [21 [201 ESI [261 El CO 
since Algorithm [1] is a generalization of the algorithms proposed in [2 [201 EJ CO ; see Section [H 

In the second part of the section, we compare Algorithm [1] with the Algorithm of Bittorf et al. [6]. 
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5.1 Comparison with Fast Hyperspectral Unmixing Algorithms 

We compare the following algorithms: 

1. Algorithm [JJ with f(x) = 1 1 a? 1 1 § - We will only test this variant because, according to our 
analysis, it is the most robust. (Comparing different variants of Algorithm [1] is a topic for 
further research.) The computational cost is rather low: steps 3 and 5 are the only steps 
requiring computation, and have to be performed r times. We have 

• Step 3. Compute the squared norm of the columns of R, which requires n times 2m 
operations (squaring and summing the elements of each column), and extract the maximum, 
which requires n comparisons, for a total of approximately 2mn operations. 

• Step 5. It can be compute in the following way 

/ Ujuf \ u - „ 

\ \\ u j\\2 J \\ u j\\2 

where computing x T = ujR requires 2mn operations, y = ,, u ], i m operations, and R — yx T 
2mn operations, for a total of approximately 4mn operations. 

The total computational cost of Algorithm [Tj is then about 6mnr operations, plus some negligible 
terms. 

Remark 7 (Sparse Matrices). If the matrix M is sparse, R will eventually become dense which 
is often impractical. Therefore, R should be kept in memory as the original matrix M minus the 
rank-one updates. 

Remark 8 (Reducing Running Time). Using recursively the formula 

— uu T )v\\\ = \ \v ||| — (u T v) 2 for any u, v € M m with \ \u\\2 = 1, 

we can reduce the computational cost to about 2mnr + 0(mr 2 ) operations. Although this formula 
is unstable when u is almost parallel to v, this is negligible as we are only interested in the column 
with the largest norm (on all the tested datasets, including the 40000 randomly generated matrices 
below, we have always obtained the same results with both implementations). This is the version 
we have used for our numerical experiments as it turns out to be much faster (for example, on 
200-by-200 matrices with r = 20, it is about seven times faster, and on a real-world 188-by- 
47750 hyperspectral image with r = 15, about 20 times faster -taking less than half a second), 
and handles sparse matrices as it does not compute the residual matrix explicitly (for example, it 
takes about half a second for the 19949-by-43586 20-newsgroup dataset for r = 20). The code is 



available at https :// sites . google, com/ site/ nicolasgillis/ code . All experiments have 



been performed with MATLAB R2011b on a laptop with 2GHz Intel Core H-2630QM. 

2. Pure Pixel Index (PPI) [7]. PPI uses the fact that the maxima (and minima) of randomly 
generated linear functions over a polytope are attained on its vertices with probability one. 
Hence, PPI randomly generates a large number of linear functions (that is, functions f(x) = c T x 
where c G M. m is uniformly distribution over the sphere), and identifies the columns of M 
maximizing and minimizing these functions. Under the separability assumption, these columns 
must be, with probability one, vertices of the convex hull of the columns of M. Then, a score is 
attributed to each column of M: it is equal to the number of times the corresponding column 
is identified as a minimizer or maximizer of one of the randomly generated linear functions. 
Finally, the r columns of M having the largest score are identified as the columns of W. 

Letting K be the number of generated linear functions, we have to evaluate K times linear 
functions overs n vertices in dimension m for a total computational cost of O(Kmn). For our 
experiments, we will use K = 1000. There are several pitfalls in using PPI: 
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It is not robust to noise. In fact, linear functions can be maximized at any vertex of the 
convex hull of a set of points. Therefore, in the noisy case, as soon as a column of the 
perturbed matrix M 1 is not contained in the convex hull of the columns of W, it can be 
identified as a vertex. This can occur for arbitrarily small perturbation, as will be confirmed 
by the experiments below. 

If not enough linear functions are generated, the algorithm might not be able to identify all 
the vertices (even in the noiseless case) . This is particularly critical in case of ill-conditioning 
because the probability that some vertices maximize a randomly generated linear function 
can be arbitrarily low. 

If the input noisy data matrix contains many columns close to a given column of matrix 
W, the score of these columns will be typically small (they essentially share the score of 
the original column of matrix W) , while an isolated column which does not correspond to a 
column of W could potentially have a higher score than these columns, hence be extracted. 
This can for example be rather critical for hyperspectral images where there typically are 
many pixels close to pure pixels (i.e., columns of M corresponding to the same column of 
W). Moreover, for the same reasons, PPI might extract columns of M corresponding to 
the same column of W . 

3. Vertex Component Analysis (VCA) [19]. The first step of VCA is to preprocess the data 
using principal component analysis which requires 0(nm 2 + m 3 ) fl9j. Then, the core of the 
algorithm requires 0(rm 2 ) operations (see below), for a total of 0(nm 2 + m 3 ) operations^. 
Notice that the preprocessing is particularly well suited for datasets where m <C n, such as 
hyperspectral images, where m is the number of hyperspectral images with m ~ 100, while n is 
the number of pixels per image with n ~ 10 6 . 

The core of VCA is very similar to Algorithm [TJ at each step, it projects the data onto the 
orthogonal complement of the extracted column. However, instead of using a strictly convex 
function to identify a vertex of the convex hull of M (as in Algorithm [1]) , it uses a randomly 
generated linear function (that is, it selects the column maximizing the function f(x) = c T x 
where c is randomly generated, as PPI does). Therefore, for the same reasons as for PPI, the 
algorithm is not robust. However, it solves the second and third pitfalls of PPI (see point (b) 
and (c) above), that is, it will always be able to identify enough vertices, and a cluster of points 
around a vertex are more likely to be extracted than an isolated point. Note that, in the VCA 
implementation, only one linear function is generated at each step which makes it rather sensitive 
to this choice. 

Finally, the two main differences between VCA and Algorithm [1] are that: (1) VCA uses a pre- 
processing (although we could implement a version of Algorithm[T]with the same pre-processing), 
and (2) VCA uses randomly generated linear functions to pick vertices of the convex hull of the 
columns of M (which makes it non-robust to noise, and also non-deterministic). 

4. Simplex Volume Maximization (SiVM) [22J. SiVM recursively extracts columns of the 
matrix M while trying to maximize the volume of the convex hull of the corresponding columns. 
Because evaluating the volumes induced by adding a column not yet extracted to the previously 
selected ones is computationally expensive, the function is approximated via a heuristic, for a 
total computational cost of 0(mnr) operations (as for Algorithm [1]) . The heuristic assumes that 
the columns of W are located at the same distance, that is, | \wi — Wj\ I2 = \\u>k — W1W2 for all i 7^ j 
and k ^= I. Therefore, there is not guarantee that the algorithm will work, even in the noiseless 
case; this will be particularly critical for ill-conditioned problems, which will be confirmed by 
the experiments below. 

2 The code is available at http://www.lx.it.pt/~bioucas/code.htm 
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We now generate several synthetic datasets allowing to highlight the properties of the different 
algorithms, and in particular of Algorithm[TJ We are going consider noisy separable matrices generated 
as follows. 

1. The matrix W will be generated in two different ways : 

(a) Uniform Distribution. The entries of matrix W G M 200x20 are randomly and independently 
generated following an uniform distribution between and 1 (using the rand() function of 
MATLAB). 

(b) Ill-conditioned. First, we generate a matrix W' G ]j 2 00x20 ag a b ove (that is, using the rand() 
function of MATLAB). Then, we compute the compact SVD (U G R 200x2 °, £ e M 20x20 , V G 
M 20x20 ) of W = UZV T , and finally generate W = USV T where S is a diagonal matrix 
whose diagonal entries are equal to a* -1 for i = 1,2, ...,20 where a 19 = 10~ 3 (that is, 
a = 0.695) in such a way that a x {W) = 1 and a 20 (W) = 10~ 3 , hence k(W) = 1000. 

2. The matrices H and N will be generated in two different ways as well : 

(c) Middle Points. We set H = [J 2 o, H'] G R 20x210 5 where the columns of H' contain all 
possible combinations of two non-zero entries equal to 0.5 at different positions (hence H' 
has ( 2 2 °) = 190 columns). This means that the first 20 columns of M are equal to the 
columns of W while the 190 remaining ones are equal to the middle points of the columns 
of W. We do not perturb the first 20 columns of M (that is, rij = for 1 < i < 20), while, 
for the 190 remaining ones, we use 

m = 5 {rrii - w) for 21 < i < 210, 5 > 0, 

where id is the average of the columns of W (geometrically, this is the vertex centroid of 
the convex hull of the columns of W). This means that we move the columns of M toward 
the outside of the convex hull of the columns of W . Hence, for any 5 > 0, the columns 
of M' are not contained in the convex hull of the columns of W (although the rank of M' 
remains equal to 20). 

(d) Dirichlet and Gaussian Distributions. We set H = [I20, I20, H'\ G M 20x240 5 where the 
columns of H' are generated following a Dirichlet distribution whose r parameters are 
chosen uniformly in [0, 1] (the Dirichlet distribution generates vectors on the boundary 
of A r , that is, X^fc^'i(^) = 1 We perturb each entry of M independently using the 
normal distribution: 

ni(k) ~ 6Af(0, 1) for 1 < i < 240, 1 < k < 200. 

(The expected value of the ^-norm of the columns of N is y/m5, the square root of the 
expected value of a Chi-squared distribution.) Notice that each column of W is present 
twice as a column of M (in terms of hyperspectral unmixing, this means that there are two 
pure pixels per endmember). 

Finally, we construct the noisy separable matrix M' = WH + N in four different ways, see Table El 
where W, H and N are generated as described above for a total of four experiments. For each 
experiment, we generate 100 matrices for 100 different values of 5 and compute the percentage of 
columns of W that the algorithms were able to identify (hence the higher the curve, the better); see 
Figure [TJ (Note that we then have 10000 matrices generated for each experiment.) We observe the 
following 

• Algorithm Q] is the most robust algorithm as it is able to identify all the columns of W for the 
largest values of the perturbation 5 for all experiments; see Table [3] and Figure [TJ 
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Table 2: Generation of the noisy separable matrices for the different experiments and average value 
of k(W), K(W), and a r (W). 





Exp. 1 


Exp. 2 


Exp. 3 


Exp. 4 


w 


(a) 


(a) 


(b) 


(b) 


N and H 


(c) 


(d) 


(c) 


(d) 


k(W) 


10.84 


10.84 


1000 


1000 


K{W) 


8.64 


8.64 


0.41 


0.41 


<Jr(W) 


2.95 


2.95 


10~ 3 


10" 3 


Average (maxj \\m 2 <5 _1 ) 


3.05 


16.15 


0.29 


16.15 



Table 3: Robustness: maximum values of 5 for perfect recovery. 





Exp. 1 


Exp. 2 


Exp. 3 


Exp. 4 


Algorithm 1 


0.252 


0.238 


0.011 


1.74*10 -4 


PPI 


0.126 


0.224 








VCA 





0.210 








SiVM 


0.175 












• In Exp. 1, PPI and SiVM perform relatively well, the reason being that the matrix W is well- 
conditioned (see Table [2J while VCA is not robust to any noise. In fact, as explained in Sec- 
tion 15. 1| VCA only uses one randomly generated linear function to identify a column of W at 
each step, hence can potentially extract any column of M since they all are vertices of the convex 
hull of the columns of M (recall the last columns of M correspond to the middle points of the 
columns of W and are perturb toward the outside of the convex hull of the columns of W) . 

• In Exp. 2, the matrix W is well-conditioned so that SiVM still performs well. PPI is now unable 
to identify all columns of M, because of the repetition in the dataset (each column of W is 
present twice as a column of ikfjj. VCA now performs much better because the columns of M 
are strictly contained in the interior of the convex hull of the columns of W. 

• In Exp. 3 and 4, SiVM performs very poorly because of the ill-conditioning of matrix W. 

• In Exp. 3, as opposed to Exp. 1, PPI is no longer robust because of ill-conditioning, although 
more than 97% of the columns of W are perfectly extracted for all 5 < 10~ 3 . VCA is not robust 
but performs better than PPI, and extracts more than 97% of the columns of W for all 5 < 10~ 2 
(note that Algorithm [1] does for 5 < 0.05). 

• In Exp. 4, PPI is not able to identify all the columns of W because of the repetition, while, as 
opposed to Exp. 2, VCA is not robust to any noise because of ill-conditioning. 

• Algorithm [1] is the fastest algorithm although PPI and SiVM have roughly the same computa- 
tional time. VCA is slower as it uses PCA as a preprocessing; see Table HI 

These experiments also show that the error bound derived in Theorem [3] is rather loose, which can 
be partly explained by the fact that our analysis considers the worst-case scenario. Recall that the 

3 For 8 = 0, we observe that our implementation of the PPI algorithm actually recovers all columns of W . The reason 
is that the first columns of M are exactly equal to each other and that the MATLAB max(.) function only outputs the 
smallest index corresponding to a maximum value. 
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Table 4: Average computational time in seconds for the different algorithms. 





Exp. 1 


Exp. 2 


Exp. 3 


Exp. 4 


Algorithm 1 


0.0080 


0.0087 


0.0086 


0.0086 


PPI 


0.052 


0.051 


0.049 


0.054 


VCA 


2.69 


0.24 


2.71 


1.41 


SiVM 


0.025 


0.027 


0.028 


0.027 



value of e in Theorem is the smallest value such that ||n.j||2 < e for all i; see Equation (|10|) . Table [2] 
gives the average value of the maximum norm of the columns of iV for each experiment. Based on 
these values, the first row of Table [5] shows the average upper bound for 5 to guarantee recovery; see 
Theorem [3] (we used the improved bound from Remark [5] when fi = L, that is, the constant 80 can be 
replaced with 64). 

Table 5: Comparison of the average value of 5 predicted by Theorem [3] to guarantee recovery for 
Algorithm Q] with f(x) = 1 1 a? 1 1 § , compared to the observed values. 





Exp. 1 


Exp. 2 


Exp. 3 


Exp. 4 


Th.[3] 

Observed 


4.6* 10" 5 
0.259 


8.7* 10" 6 
0.238 


8.4*10~ 12 
0.011 


1.5*10~ 13 
1.74* 10" 4 



5.2 Comparison with the Algorithm of Bittorf et al. [6] 

In this section, we compare the Algorithm of Bittorf et al. (BRRT) (Algorithm 3 in [6 J jf| with Algo- 
rithm [TJ BRRT solves the following optimization problem 

min p T diag(A) 

such that max \\M'(:,i) - M'X(:,i)\\i < 2ei, 

l<i<n 

e T diag(X) = r, (18) 
< Xij < Xa < 1, for all 

where p G M. n is any vector with distinct values, M' = WH + N is the noisy separable matrix and 
e\ = maxj 1 1 rii| |i. For each column of W, Bittorf et al. prove that there must be a diagonal entry of the 
optimal solution X* of (|18p larger than i so that the corresponding column of M is at distance at most 
2ei of that column of W, given that ei is sufficiently small (see Section T2.4p . All other diagonal entries 
of X* are smaller than i. In the experiments below, we will select the r columns of M corresponding 
to the r largest entries of the diagonal of X* . Problem (|18|) will be solved using CVX [T7] hence we are 
only able to solve small-scale problems (in fact, CVX uses an interior-point method and formulating 
Problem (| 18|) as a linear program requires 0{n 2 + mn) variables). For that reason, we perform exactly 
the same experiments as in the previous section but for very small dimensions: m = 10 and r = 5 for 
all experiments, and 

• n = 5 + (2) = 15 for the first and third experiments (the last ten columns of M are the middle 
points of the five columns of W). 

4 We do not perform a comparison with the algorithm of Arora et al. [3] as it is not very practical (the value of a has 
to be estimated, see Section \2A^i and has already been shown to perform similarly as BRRT in [6]. 
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• n = 5 + 5 + 10 = 20 for the second and fourth experiments (we repeat twice each endmember, 
and add 10 points in the convex hull of the columns of W). 

Note that, as the noise matrix N is known, the parameter e\ can be estimated exactly. 

The average running time for BRRT on these datasets using CVX [T7] is about two seconds while, 
for Algorithm^ it is less than 10 -3 seconds. (Bittorf et al. [6j propose a more efficient solver than CVX 
for their LP instances. As mentioned in Section T2.41 even with their more efficient solver, Algorithm [1] 
is much faster for large n.) Figure [5] shows the percentage of correctly extracted columns with respect 
to 5, while Table [6] shows the robustness of both algorithms. 



Table 6: Robustness: maximum values of 5 


"or perfect recovery. 




Exp. 1 


Exp. 2 


Exp. 3 


Exp. 4 


Algorithm [T] 
Bittorf et al. (BRRT) [6] 


0.070 

0.042 


1.4*10~ 6 
5.5*10 6 


7.9* 10" 4 
1.8*10 3 


2.4*10 6 

10" 6 



Quite surprisingly, Algorithm [T] performs in average better than BRRT. Although BRRT is more 
robust in two of the four experiments (that is, it extract correctly all columns of W for a larger value 
of 8), the percentage of columns it is able to correctly extract decreases much faster as the noise 
level increases. For example, Table [7] shows the maximum value of 5 for which 99% percent of the 
columns of W are correctly extracted. In that case, Algorithm [T] always performs better. A possible 



Table 7: Maximum values of 5 for recovering 99% of the co 





Exp. 1 


Exp. 2 


Exp. 3 


Exp. 4 


Algorithm [T] 
Bittorf et al. (BRRT) [6] 


0.126 
0.05 


2.8* 10" 3 
4.0* 10" 4 


1.2*10" 2 
2.2* 10" 3 


10" 4 
1.8*10" 5 



umns of W. 



explanation for this behavior is that, when the noise is too large, the condition for recovery are not 
satisfied as the input matrix is far from being separable. However, using Algorithm Q] still makes sense 
as it extract columns whose convex hull has large volume [91 [10] while it is not clear what BRRT 
does in that situation (as it heavily relies on the separability assumption). Therefore, although BRRT 
guarantees perfect recovery for higher noise levels, it appears that, in practice, when the noise level is 
high, Algorithm [1] is preferable. 

6 Conclusion and Further Work 

In this paper, we have introduced and analyzed a new family of fast and robust recursive algorithms 
for separable nonnegative matrix factorization problems which are equivalent to hyperspectral unmix- 
ing problems under the linear mixing model and the pure-pixel assumption. This family generalizes 
several existing hyperspectral unmixing algorithms, and our analysis provides a theoretical framework 
to explain the better performances of these approaches. In particular, our analysis explains why algo- 
rithms like PPI and VCA are less robust against noise compared to Algorithm 1 or other approaches 
introduced in [3J [B] . 

Finally, many questions remain open, and would be interesting directions for further research: 

• Is it possible to provide better error bounds for Algorithm [1] than the ones of Theorems [3] and 
In other words, is our analysis tight? Also, can we improve the bounds if we assume a specific 
generative model? 
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• How can we choose appropriate functions f{x) for Algorithm [T] depending on the input data 
matrix and the noise model? 

• Can we design other robust algorithms for the separable NMF problem leading to better error 
bounds? In particular, it would be interesting to find out whether Algorithm [1] is 'optimal', that 
is, does it exist an algorithm running in 0(mnr) guaranteeing better error bounds? 
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Figure 1: Comparison of Algorithm [H PPI, VCA and SiVM. From top to bottom: Exp. 1, Exp. 2, 
Exp. 3, and Exp. 4. 
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Figure 2: Comparison of Algorithm [I] and BRRT [6]. From top to bottom: Exp. 1, Exp. 2, Exp. 3, 
and Exp. 4. 
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